VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval

نویسندگان

چکیده

Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation pairwise text–image inputs via early interaction, accuracy vision–language (VL) transformers outperformed existing methods retrieval. However, when same paradigm is used inference, efficiency VL still too low to be applied in a real cross-modal SE. Inspired by mechanism human learning and using knowledge, this paper presents novel Vision–Language Decomposed Transformer (VLDeformer), which greatly increases while maintaining their outstanding accuracy. By proposed method, cross-model separated into two stages: transformer stage, decomposition stage. The latter stage plays role single modal indexing, some extent like term indexing text model learns knowledge from early-interaction pre-training then decomposed an individual encoder. VLDeformer requires only small target datasets supervision achieves both 1000+ times acceleration less than 0.6% average recall drop. also outperforms state-of-the-art visual-semantic embedding on COCO Flickr30k.1

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Modal Manifold Learning for Cross-modal Retrieval

This paper presents a new scalable algorithm for cross-modal similarity preserving retrieval in a learnt manifold space. Unlike existing approaches that compromise between preserving global and local geometries, the proposed technique respects both simultaneously during manifold alignment. The global topologies are maintained by recovering underlying mapping functions in the joint manifold spac...

متن کامل

MHTN: Modal-adversarial Hybrid Transfer Network for Cross-modal Retrieval

Cross-modal retrieval has drawn wide interest for retrieval across different modalities of data (such as text, image, video, audio and 3D model). However, existing methods based on deep neural network (DNN) often face the challenge of insufficient cross-modal training data, which limits the training effectiveness and easily leads to overfitting. Transfer learning is usually adopted for relievin...

متن کامل

A Comprehensive Survey on Cross-modal Retrieval

In recent years, cross-modal retrieval has drawn much attention due to the rapid growth of multimodal data. It takes one type of data as the query to retrieve relevant data of another type. For example, a user can use a text to retrieve relevant pictures or videos. Since the query and its retrieved results can be of different modalities, how to measure the content similarity between different m...

متن کامل

Cross-Modal Retrieval: A Pairwise Classification Approach

Content is increasingly available in multiple modalities (such as images, text, and video), each of which provides a different representation of some entity. The cross-modal retrieval problem is: given the representation of an entity in one modality, find its best representation in all other modalities. We propose a novel approach to this problem based on pairwise classification. The approach s...

متن کامل

Heterogeneous Metric Learning for Cross-Modal Multimedia Retrieval

Due to the massive explosion of multimedia content on the web, users demand a new type of information retrieval, called cross-modal multimedia retrieval where users submit queries of one media type and get results of various other media types. Performing effective retrieval of heterogeneous multimedia content brings new challenges. One essential aspect of these challenges is to learn a heteroge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Knowledge Based Systems

سال: 2022

ISSN: ['1872-7409', '0950-7051']

DOI: https://doi.org/10.1016/j.knosys.2022.109316